SUMMARY
=======

This is a quick introduction to the Lightweight Data Pipeline, a set of
sh/bash functions for:

- file dependency checking,
- NFS-safe file locking,
- injection of "manual override" files.


USAGE
=====

To use LWDP, simply put the following near the top of your processing script:

source LWDPDIR/lwdp.sh

where "LWDPDIR" is the directory containing your copy of lwdp.h.


DEPENDENCY CHECKING
===================

Dependency checking and conditional file updating is done as follows:

if LWDP_needs_update target source1 source2 source3; then
   cat source1 source2 source3 > target
fi

This will determine, by comparison of file time stamps, whether file "target"
is older than any of the source files, "source1" through "source3." An
arbitrary number of source files can be provided. If any one of them is newer
than "target", or if "target" does not exist, then it will be (re-)created.


SAFE PARALLEL PROCESSING
========================

If you want to run the same script simultaneously on multiple CPUs (or
multiple machines, e.g., in a cluster), use the following instead:

if LWDP_needs_update_and_lock target source1 source2 source3; then
   cat source1 source2 source3 > target
   LWDP_lockfile_delete target
fi

This will ensure that only one copy of the script will work on any given
target file.

For NFS-safe file locking, it is advisable to have the "lockfile" tool from
the "procmail" package installed on all machines using the script. If
"lockfile" cannot be found, a built-in fallback is used, but it is NOT 100%
safe unfortunately, because sh does not allow for atomic file system
operations.

IMPORTANT - it is vital that "lockfile" exists and is in the binary search
path on ALL machines running the same script, or on NONE of them. File locking
using "lockfile" and built-in fallback-locking are NOT compatible.


MANUAL OVERRIDE FILES
=====================

To inject manual override files into the processing stream, two functions are
defined:

- LWDP_get_override_file
- LWDP_get_override_file_list

The first determines an optional override for a single file, the second
determines one for each in a list of files.

What do I mean by override?

Say you have a file that is generated by some processing, but the processing
sometimes fails and gives an unusable file. As an example, the processing
could be segmentation of an image into multiple regions, but in some cases,
the segmentation algorithm fails. In this case, it may be desirable to use a
manually-corrected file instead in any further processing.

This can be done as follows:


file_to_use=/some/path/to/file.sfx
file_to_use=$(LWDP_get_override_file ${file_to_use})


This will test whether a file

/some/path/to/manual/file.sfx

exists and return this path if it does. If the override file does not exist,
the original path is returned.

The "LWDP_get_override_file_list" function does the same thing, but for each
in a list of files.


TRANSPARENT FILE COMPRESSION
============================

Virtually all functions of LWDP treat compressed files (with suffixes .gz,
.bz2, .xz, and .Z) transparently, which means:

1. in any dependency check, if a source file does not exist, LWDP will check
   for a compressed version and, if one exists, will use that instead.

2. in any dependency check, a compressed version of the target file is treated
   exactly like an uncompressed one.

3. when checking for manual override files, any compressed override file will
   be used if an uncompressed one does not exist. In this case, the override file
   path will be the full path including compression suffix. If no override
   file exists but the original file is compressed, the returned path will also
   contain the compression suffix.